4.1 - Working with NLTK

In this notebook we will work through some of the initial examples in the NLTK book written by Steven Bird, Ewan Klein, and Edward Loper. You can follow along by going to the first chapter of the book:

http://www.nltk.org/book/ch01.html

1.2 Getting Started with NLTK



In [ ]:

    
# First we import the NLTK library and download 
# the data used in the examples in the book
# the data will be stored in a directory on the
# virtual machine but accessible through your notebooks

import nltk
nltk.download()



In [ ]:

    
from nltk.book import *

1.3 Searching Text



In [ ]:

    
text5.concordance("lol")



In [ ]:

    
text1.similar("monstrous")



In [ ]:

    
text2.common_contexts(["monstrous", "very"])



In [ ]:

    
text4.dispersion_plot(["citizens", "democracy", "freedom", "duties", "America"])

1.4 Counting Vocabulary



In [ ]:

    
print('number of words:', len(text3))



In [ ]:

    
sorted(set(text3))[:20]



In [ ]:

    
len(set(text3))



In [ ]:

    
len(set(text3)) / len(text3)



In [ ]:

    
text3.count("smote")



In [ ]:

    
100 * text4.count('a') / len(text4)



In [ ]:

    
def lexical_diversity(text):
    return len(set(text)) / len(text)

def percentage(count, total):
    return 100 * count / total



In [ ]:

    
print(lexical_diversity(text3))
print(lexical_diversity(text5))
print(percentage(4, 5))
print(percentage(text4.count('a'), len(text4)))

3.1 Frequency Distributions



In [ ]:

    
fdist1 = FreqDist(text1)
print(fdist1)
fdist1.most_common(50)



In [ ]:

    
fdist1.plot(50, cumulative=True)



In [ ]:

    
fdist1.hapaxes()

3.2 Fine-grained Selection of Words



In [ ]:

    
V = set(text1)
long_words = [w for w in V if len(w) > 12 and fdist1[w] > 7]
sorted(long_words)

3.3 Collocations and Bigrams



In [ ]:

    
list(nltk.bigrams(['more', 'is', 'said', 'than', 'done']))



In [ ]:

    
text4.collocations()



In [ ]:

    
text8.collocations()

3.3 Counting Other Things



In [ ]:

    
fdist = FreqDist(len(w) for w in text1)
fdist



In [ ]:

    
print(fdist.most_common())
print(fdist.max())
print(fdist[3])
print(fdist.freq(3))



In [ ]:

    
tricky = sorted(w for w in set(text2) if 'cie' in w or 'cei' in w)
for word in tricky:
    print(word, end=' ')

Keep following along with the rest of the examples in the NLTK book to get familiar with a variety of Natural Language Processing (NLP) techniques using Python and the NLTK library.